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Tool 


Executive  Summary 

This  report  presents  testing  and  improvements  to  a  protocol,  rsync,  for  the 
synchronisation  of  similar  data  files  in  different  locations.  Implementation  of  the 
recommendations  provided  here  can  result  in  bandwidth  savings  of  at  least  50%  for 
large  files,  as  well  as  giving  significant  reductions  in  computer  effort,  resulting  in 
smaller  delays. 

When  copying  a  file  A  at  location  a  to  a  remote  location  (3  it  is  often  the  case  that  A  has 
much  in  common  with  some  data  file  B  already  stored  at  (3.  In  this  situation  rsync  may 
be  used  to  effectively  send  File  A  from  a  to  p  in  such  a  way  that  much  less  data  than 
that  contained  in  A  is  transmitted.  Moreover,  this  is  achieved  without  requiring  both 
files  to  be  located  at  either  a  or  p. 

Rsync  may  be  particularly  useful  in  synchronising  databases  that  have  had  significant 
disconnections  or  outages,  resulting  in  failure  of  the  existing  synchronisation  scheme. 
In  this  situation  rsync  may  be  used  to  efficiently  synchronise  databases  without  any 
version  control  or  common  reference  point.  Another  major  application  of  rsync  is  in 
maintaining  Web  pages  which  are  regularly  being  changed  at  the  server  and  have  to 
be  synchronised  with  the  clients.  Thus  the  changes  to  client  files  are  identified  and 
only  the  updates  to  files  are  sent  from  the  server.  This  is  achieved  without  the  need  for 
the  server  to  maintain  any  records  of  client  files  or  to  store  old  versions. 

The  data  efficiencies  that  rsync  offers  make  it  an  attractive  choice  for  communications 
channels  of  limited  bandwidth,  which  occur  in  many  forms  of  military 
communications. 

The  specific  contribution  of  this  report  is  to  provide  new  checksum  functions  that  offer 
significant  improvements  over  the  existing  functions,  some  of  which  are  shown  to 
have  close  to  ideal  properties.  We  also  propose  that  checksum  sizes  be  adaptive, 
depending  on  the  file  size,  and  verify  a  model  of  the  required  checksum  size. 


Authors 


Richard  Taylor 

Communications  Division 

Richard  Taylor  is  the  Head  of  the  Network  Integration  Group 
of  the  Defence  Science  and  Technology  Organisation's 
(DSTO)  Communications  Division.  A  PhD  in  Mathematics 
from  the  University  of  Melbourne,  Richard  has  worked  at  the 
Telecom  Research  Laboratories  in  Victoria,  and  has  over  9 
years  experience  in  the  fields  of  communication  reliability 
and  security. 


Rittwik  Jana 

Information  Technology  Division 

Rittwik  Jana  is  a  Professional  Officer  with  the  Intelligence 
Systems  Group  of  the  Defence  Science  and  Technology 
Organisation's  (DSTO)  Information  Technology  Division.  His 
main  research  interests  include  transmission  of  imagery  over  low 
bandwidth  communication  channels.  He  is  currently  pursuing  a 
PhD  in  telcommunications  at  the  Australian  National  University. 


Mark  Grigg 

Information  Technology  Division 

Mark  Grigg  has  a  B.Eng  (Electronics)  and  a  B.App.Sci.  (Physics) 
from  RMIT,  and  a  PhD  in  Physics  from  the  University  of 
Melbourne.  He  joined  DSTO  in  1994  as  a  Research  Scientist  with 
the  Intelligence  Systems  Group  in  ITD.  His  current  research 
interests  lie  in  the  areas  of  digital  image  coding,  signal  processing, 
and  the  application  of  Web  based  technologies  to  the  access  and 
dissemination  of  information. 


Contents 


1.  INTRODUCTION . 1 

2.  HOW  RSYNC  WORKS . 1 

3.  ROLLING  CHECKSUMS . 2 

4.  CHECKSUM  ANALYSIS . 5 

5.  SIMULATIONS . 6 

6.  ADAPTIVE  CHECKSUM  SIZE . 9 

7.  CONCLUSIONS . 10 


8.  REFERENCES 


10 


DSTO-TR-0627 


1.  Introduction 


When  copying  a  File  A  at  location  a  to  a  remote  location  (3  it  is  often  the  case  that  A  has 
much  in  common  with  some  data  File  B  already  stored  at  ft.  In  this  situation  the  rsync 
protocol  [5]  may  be  used  to  effectively  send  File  A  from  a  to  (3  in  such  a  way  that  much 
less  data  than  that  contained  in  A  is  transmitted.  Moreover  this  is  achieved  without 
requiring  both  files  to  be  located  at  either  a  or  /3.  This  is  particularly  useful  if  the 
communication  channel  is  of  limited  bandwidth,  which  occurs  in  many  forms  of 
military  communications. 

At  the  heart  of  the  rsync  protocol  are  checksums  which  are  used  to  uniquely  identify 
blocks  of  data,  and  so  identify  the  similarities  and  differences  between  A  and  B.  This 
report  analyses  the  strength  of  the  checksums  provided  with  rsync,  and  investigates 
new  checksum  designs  with  improved  strength.  The  analysis  suggests  that  the 
bandwidth  requirement  of  rsync  may  be  considerably  reduced  without  a  significant 
risk  of  failure  due  to  checksum  collisions.  The  potential  use  of  the  rsync  algorithm  is  in 
synchronising  distributed  information  systems.  In  particular  it  may  be  used  to 
synchronise  databases  that  have  had  significant  disconnections,  resulting  in  failure  of 
the  existing  synchronising  scheme,  or  Web  pages  which  are  regularly  being  changed  at 
the  server  and  have  to  be  in  synchronisation  with  the  clients. 

2.  How  Rsync  works 

The  algorithm  may  be  better  understood  with  reference  to  Figurel  (see  [5]  for  more 
details): 


Source: 

©  , 

Destination: 

File  A 

«-3— 

FileB 

(current 

version) 

(old  version) 

Computer  a 

_ ©  » 

Computer  p 

Figure  1  -  Information  flow  during  rsync  operation. 

The  aim  is  to  update  File  B  (the  old  version)  with  File  A  (the  current  version).  There  are 
three  important  transactions  that  occur  during  the  execution  of  the  algorithm,  as 
shown  by  the  arrows  in  Figure  1. 
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•  Step  1:  a  notifies  (3  that  an  rsync  operation  is  to  be  initiated  from  File  A  to  File  B. 

•  Step  2:  f3  partitions  File  B  into  non-overlapping  blocks  each  of  size  b  bytes.  For  each 
of  these  blocks  a  simple  32  bit  checksum  and  a  much  stronger  128  bit  checksum 
(MD5  see  [2],  [3])  is  calculated.  These  checksums  are  consolidated  into  a  table  and 
sent  back  to  a. 

•  Step  3:  a  scans  through  File  A  and  calculates  checksums  for  all  blocks  of  length  b 
bytes  at  all  offset  positions.  These  checksums  are  used  to  determine  blocks  of  data 
in  File  B(in  any  position)  that  match  blocks  in  File  A.  32  bit  checksums  are 
calculated  and  checked  first,  if  a  match  is  found  within  the  received  table  then  the 
128  bit  checksum  is  calculated  and  checked  to  be  surer  that  the  blocks  actually 
match. 

•  Step  4:  a  sends  f3  a  sequence  of  instructions  for  constructing  a  copy  of  A.  Each 
instruction  is  either  a  reference  to  a  block  of  data  or  literal  data.  Literal  data  is  sent 
only  for  those  blocks  of  A  which  are  different  to  any  of  the  blocks  in  B. 

The  computationally  intensive  component  of  the  algorithm  is  Step  3,  since  checksums 
are  calculated  and  matches  within  a  table  sought  for  every  byte  offset  in  File  A.  In 
order  to  make  the  computations  feasible,  checksums  that  are  simple  to  compute,  and  in 
particular  can  be  updated  quickly  at  each  new  offset,  are  used. 

3.  Rolling  Checksums 


Consider  a  block  of  bytes  Xt ,..,  X/.  If  it  is  possible  to  efficiently  calculate  the  checksum 
for  the  block  of  bytes  Xk+i  ,..,X/+i  given  the  checksum  for  the  buffer  Xk  ,..,  X;  and  the 
values  of  the  bytes  Xt  and  X/+j  then  we  say  that  the  checksum  has  the  rolling  property. 
Let  P  mod  [Q]  denote  the  member  of  the  residue  class  P  modulo  Q  that  lies  in  the  range 
0  to  Q-l.  In  the  original  paper  [5],  a  simple  rolling  checksum  S(k,l)  was  used,  namely: 

S(k,l)  =  T(k,l)  +  2]6U(k,l)  where 


T(k,l)  = 


(  1  \ 

V  i=k  J 


f  i 


mod[216],  U(k,l)  = 


£  (/-/  +  1)X,  mod[216]. 


V.  i=k 


The  rolling  property  of  S(k,l)  follows  since 

T(k  +  l,l  +  \)  =  (T(k,l)-Xk  +  XM ) mod  [216], 

U(k  +  U  + 1)  =  (U(k, l)-(l-k  +  1)X*  +T(k  +  l,l  + 1)) mod  [216]. 

We  present  two  families  of  16  bit  rolling  checksums,  each  with  four  functions.  Cl,  C2, 
C3,  C4,  and  Dl,  D2,  D3,  D4.  We  analyse  their  strength,  and  compare  with  that  of  T  and 
U.  A  similar  analysis  of  all  6  concatenated  pairs  of  C1,...,C4  and  Dl,  D2,  D3,  D4,  with  S 
is  performed  to  see  how  well  the  16  bit  checksums  combine.  The  two  families  of 
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functions  are  examples  of  the  trade-off  between  simplicity  and  speed  as  against 
checksum  strength.  The  C  family  of  functions  are  simpler  and  quicker  to  compute,  but 
as  we  shall  show  the  D  functions  are  stronger  and  combine  better. 

Define  Cl,  C2,  C3,  C4  on  the  data  block  Xk, ...  X;  with  b=l-k+l  elements  by 

Cl(k,  l)=X,+  2X,_,  +  22  X,_2+. .  .+2h~'  Xk  mod[216  - 1], 


Then 


C\(k  + 1,  l  + 1)  =  XM  +  2X,+22  XM  +. .  .+2h~'  Xk+l  mod[216  - 1] 

=  2C1(A:,  /)  +  X/+I  -  2h  Xk+l  mod[2'6  - 1], 

Similarly  define 

C2(k,l)  =  X,  +  8Xm  +  82  Xt_2+. .  ,+8b~'  Xk  mod[216  - 1], 

C3(k,l)  =  Xl  +32Xm  +322 X,_2+...+32h-1  Xk  mod[216  -1], 

C4(k,l)  =  X,  +128Xm  +1282X;_2+...+128fc-1X,  mod[216  -1], 

Then, 

C2(k  + 1,  /  + 1)  =  SC2(k,  l)  +  XM  -  8*  Xk+]  mod[216  -1], 

C3{k  +  1,1  + 1)  =  32C2(k,  l )  +  X,+l  -  32fe  XM  mod[216  - 1], 

C4(k  + 1,  /  + 1)  =  128C2(fe, /)  +  X/+1  - 128*  Xk+l  mod[216  -1], 

By  choosing  b  to  be  a  multiple  of  16  we  have  8bs32b=128b=l  mod  [216-1],  Multiplication 
by  a  power  of  2  may  be  evaluated  efficiently  using  the  left  shift  operation  «.  Thus  for 
b  a  multiple  of  16, 

C\(k  + 1,  l  + 1)  =  (Cl(k,l) « 1)  +  X;+1  -  Xk+l  mod[216  - 1], 

C2(k  + 1,  /  + 1)  =  (C2(k,l) «  3)  +  X;+1  -  XM  mod[2'6  - 1], 

C3(k  +  \,l  + 1)  =  (C3(k,l) «  5)  +  X,+1  -  X,+1  mod[216  - 1], 

C4(k  +  \,l  + 1)  =  (C4(k,l) «  7)  +  XM  -  Xk+l  mod[216  - 1], 

To  evaluate  the  modulus  function  efficiently  we  show  how  a  32  bit  non-negative 
integer  x  may  be  evaluated  mod  2J6-2  (see  [1]  and  [4]  for  previous  uses  of  these 
methods).  This  can  then  be  used  as  the  basis  of  evaluating  any  expression  mod  216-1. 
Let  y  and  z  be  the  top  and  bottom  16  bits  of  x,  respectively.  Thus 
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x  =  (y*216 +z),  where  0<  y,z<2'6-\ 

=  (y(2'6-\)  +  y  +  z) 

=  (y  +  z)mod(216  -1). 

To  update  the  Cl  function  set, 

x  =  (C\(k,  l) « 1)  +  X;+1  +  65535  -  Xk+l . 

If  x=y*236+z  then  since  Cl<216-2  and  X,<2S-1  it  follows  that  0<y<3.  Thus  0<y+z<216  and 
so  x  mod[216-l]  is  very  likely  to  be  simply  y+z.  Similarly  for  C2,  C3  and  C4  we  have 
0<y+z<2I6+6,  236+30,  and  216+126  respectively. 


Define  the  second  family  of  functions  as 

D\(k,  /)  =  X,  +  3X,_,  +  32  X,_2+. .  .+3b~'  Xk  mod[216  - 1], 

D2(k,  l)=X,+  5XM  +  52  X,_2+. .  .+5"-1  X,  mod[216  -  3], 

D3(k,  l)  =  X,+  7Xm  +  72  X;_2+. .  .+lh~]  Xk  mod [2 16  -  5], 

D4(k,l)=  Xt  +17X;_,  +172X/_2+...+176-1  X,  mod[216  -7]. 

The  corresponding  rolling  updates  may  then  be  evaluated  as 

D\(k  + 1,1  + 1)  =  3D\(k, l )  +  X,+1  -  3h  XM  mod[216  - 1], 

D2(k  + 1,  l  + 1)  =  5D\(k,  l )  +  XM  -  5fc  Xk+i  mod [2 16  -  3], 

D3{k  + 1,  /  + 1)  =  lD\{k,  l )  +  XM  -  lh  Xk+l  mod[216  -  5], 

D4(k  + 1, l  + 1)  =  \lD\{k, l )  +  XM  -I7h  Xk+]  mod[216  - 7]. 

Multiplication  by  3,  5,  7,  and  17  can  be  done  with  a  shift  and  an  add  or  subtract 

operation  (eg  17x-(x«4)+x).  The  powers  are  fixed  for  a  given  b  and  so  can  be  pre¬ 

calculated.  To  evaluate  the  modulus  functions,  let  y  and  z  be  the  top  and  bottom  16  bits 
of  x,  respectively.  Thus 


x  =  (y  *216  +  z),  where  0  <  y,z  <  216  -1. 

=  (y(216  -i)  +  iy  +  z) 

=  ( iy  +  z )  mod(216  -  i ). 

Here  again,  multiplication  by  i  may  be  performed  with  a  shift  and  add  as  above.  To 
update  Dl,  set 
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x  =  3Dl(k,l)  +  XM  +  (216  - 1)(28  - 1)  -  3h  XM  mod[216  - 1], 

=  3Dl(k,l)  +  XM  +16711425  -  3*  XM  mod[2i6  - 1], 

It  follows  that  0<3y+z<216+769.  Similarly  for  D2,  D3,  and  D4  we  have  zy+z<216+2293, 
216+1825  and  216+4605,  respectively.  Thus  x  mod[216-i]  is  likely  to  be  simply  iy+z,  and  if 
not  then  x  mod[216-i]=iy+z-(216-i). 

Thus  a  single  multiplication,  together  with  elementary  operations  (add,  subtract,  shift, 
assign)  are  used  in  updating  each  of  the  rolling  functions. 

4.  Checksum  Analysis 

To  test  the  strength  of  checksum  functions  we  construct  a  theoretical  model  for 
evaluating  the  probability  of  checksum  collisions  (matching  checksums  corresponding 
to  different  data  blocks). 

In  the  operation  of  rsync,  once  a  has  received  the  list  of  checksums  of  the  blocks  of  File 
B,  it  must  search  File  A  for  any  blocks  at  any  offset  that  match  the  checksum  of  some 
block  of  B.  The  32  bit  rolling  checksum  for  a  block  of  length  b  is  computed  for  each 
byte  offset  in  File  A.  This  is  then  compared  against  the  table  to  find  any  matches.  Once 
a  match  has  been  found,  a  then  sends  j3  the  corresponding  reference  to  the  data  in  A. 

Since  File  A  and  B  are  assumed  to  be  largely  similar,  and  it  is  in  this  case  that  the 
checksums  are  more  likely  to  fail  we  shall  assume  that  A  and  B  are  the  same  file.  Thus 
we  shall  examine  the  ability  of  the  checksums  to  differentiate  between  boundary 
blocks  and  offset  blocks  within  a  given  file.  Let  Y  be  the  total  number  of  data  bytes  in 
the  file,  so  the  number  of  blocks  in  the  file  is  Y/b.  The  total  number  of  shifts  less  those 
that  lie  on  the  block  boundaries  is  Y-Y/b.  Let  the  expected  number  of  checksum 
matches  in  which  the  blocks  are  different(or  False  Alarms)  be  FA.  Let  n  be  the  number 
of  bits  in  the  checksum.  Assuming  that  the  incidence  of  boundary  blocks  that  match 
some  non-boundary  block  is  small  in  comparison  to  False  Alarms,  and  that  the 
checksum  has  ideal  statistical  properties  in  differentiating  between  different  blocks  we 
have. 


Conversely,  if  we  are  comparing  the  strength  of  different  checksum  functions,  the 
number  of  False  Alarms  FA  can  be  computed  for  particular  files  and  the  effective  bit 
strength  n  of  the  checksum  calculated  from  the  above. 
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5.  Simulations 

Checksums  were  tested  with  three  different  files.  The  first  consists  of  pseudorandom 
data,  the  second  consists  of  the  data  corresponding  to  a  large  map,  the  third  is  a  large 
tar  file  generated  from  a  directory  of  Powerpoint  files. 

Over  the  pseudorandom  data,  the  effective  bit  strengths  of  all  16  bit  checksums  are 
close  to  ideal  (16  bits),  with  the  exception  of  function  T.  This  may  be  explained  by 
noting  that  T  is  a  sum  of  8  bit  integers  modulo  216  and  so  each  added  integer  is  likely  to 
change  only  the  lower  8  bits  of  the  checksum.  Indeed  one  would  expect  that  at  least 
512  bytes  need  to  added  together  to  reach  216,  and  so  bring  the  modulus  into  effect. 

There  is  some  variation  of  checksum  strengths  over  the  structured  data  tests.  In 
general  however  all  the  functions  perform  well  with  the  exception  of  T. 

In  pairwise  combinations  there  are  significant  differences  in  the  measured  strengths  of 
the  32-bit  checksums  formed,  ranging  from  26.8  to  32.1.  In  general  the  function  S  is  the 
weakest  function,  the  C  combinations  somewhat  stronger  and  the  D  combinations  the 
strongest  of  all.  In  fact  the  D  combination  checksums  have  a  strength  that  is 
consistently  close  to  ideal  (32)  over  the  test  data  (from  31.8  to  32.1 ). 


16  Bit  Checksums,  Random  Data  File,  Block 

Size  R=400  bytes,  1000  Blocks  in  File 

Checksums 

False 

Effective  1 

Alarms,  FA 

KjHIfireaali 

T 

84343 

12.2 

U 

6102 

16.0 

Cl 

5935 

. 16.0 

C2 

6225 

16.0 

C3 

6160 

16.0 

C4 

6106 

16.0  | 

D1 

6068 

16.0 

D2 

6018 

16.0 

D3 

5955 

16.0 

D4 

6036 

16.0 

Table  1.  Analysis  of  16  bit  checksum  functions  for  random  data 
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16  Bit  Checksums,  Map  Data  File,  Block 

Size  R=400  bytes,  1000  Blocks  in  File 

Checksums 

False 

Effective 

Alarms,  FA 

bit  strength 

T 

32071 

13.6 

U 

6121 

16.0 

Cl 

6187 

16.0 

C2 

5925 

16.0 

C3 

6073 

16.0 

C4 

6184 

16.0 

D1 

6223 

16.0 

D2 

6209 

16.0 

D3 

6061 

16.0 

D4 

6108 

16.0 

Table  2.  Analysis  of  16  bit  checksum  functions  for  structured  data  (map  file) 


32  Bit  Checksums,  Map  Data  File,  Block 

Size  R=400 

bytes,  50000  Blocks  in  File 

Checksums 

False 

Effective 

Alarms,  FA 

bit  strength 

S=TIU 

7007 

27.1 

C1IC2 

705 

30.4 

C1IC3 

3608 

28.0 

C1IC4 

688 

30.4 

C2IC3 

676 

30.4 

C2IC4 

3537 

28.1 

C3IC4 

723 

30.4 

D1ID2 

236 

32.0 

D1ID3 

213 

32.1 

D1ID4 

247 

31.9 

D2ID3 

238 

32.0 

D2ID4 

270 

31.8 

D3ID4 

224 

32.1 

Table  3.  Analysis  of  pairwise  combinations  of  16  bit  checksum  functions  for  structured  data 

(map  file) 
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16  Bit  Checksums,  PowerPo 
Block  Size  R=400  bytes,  1000 

int  Tar  File, 
Blocks  in  File 

Checksums 

False 

Alarms,  FA 

Effective 
bit  strength 

T 

27052 

13.9 

U 

8072 

15.6 

Cl 

10098 

15.2 

1  C2 

7133 

15.7 

C3 

5388 

16.2 

C4 

12069 

15.0 

D1 

6547 

15.9 

D2 

6712 

15.9 

D3 

5817 

16.1 

|  D4 

8999 

15.4 

Table  4.  Analysis  of  pairwise  combinations  of  16  bit  checksum  functions  for  structured  data 

(PowerPoint  tar  file) 


32  Bit  Checksums,  PowerPoint  Tar  File, 

Block  Size  R=z 

tOO  bytes,  50000  Blocks  in  File 

Checksums 

False 

Effective 

Alarms,  FA 

bit  strength 

S=TIU 

8752 

26.8 

C1IC2 

2193 

28.8 

C1IC3 

2291 

28.7 

Cl  iC4 

2152 

28.8 

C2IC3 

2167 

28.8  ! 

C2IC4 

2271 

28.7 

C3IC4 

2167 

28.8 

D1ID2 

209 

32.0 

D1ID3 

211 

32.1 

D1ID4 

248 

31.9 

D2ID3 

210 

32.0 

D2ID4 

249 

31.8  | 

D3ID4 

225 

32.0 

Table  5.  Analysis  of  pairwise  combinations  of  16  bit  checksum  functions  for  structured  data 

(PowerPoint  tar  file) 
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Adapting  the  design  of  the  checksums  given  here  to  take  advantage  of  computers  with 
larger  word  sizes,  it  would  seem  plausible  that  the  functions  modified  by  increasing 
the  moduli  should  provide  correspondingly  strong  checksums.  Thus  for  64  bit  word 
size  for  example,  simply  replace  the  moduli  216,  216-1,  2 16-3,. . .  by  232,  232-2,  232-3, .... 

6.  Adaptive  Checksum  Size 


From  equation  (1)  we  may  make  predictions  about  adequate  checksum  sizes  with  the 
use  of  checksums  with  consistent  properties.  This  may  be  used  to  minimise  the  size  of 
the  table  generated  in  Step  2  of  rsync  and  the  corresponding  bandwidth  requirement, 
while  keeping  the  probability  of  a  False  Alarm  relatively  low.  To  support  this 
approach.  Step  3  of  rsync  should  also  involve  the  sending  of  a  single  strong  checksum 
(say  128  bits)  over  the  entire  File  A  (noted  in  [2]).  This  will  flag  (3  that  rsync  has  failed, 
and  the  protocol  needs  to  be  repeated  with  some  small  file  changes. 

As  the  number  of  False  Alarms  indicated  by  (1)  increases  with  the  square  of  the  file 
size,  clearly  the  size  of  adequate  checksums  will  vary  accordingly.  Let  p  denote  the 
probability  of  at  least  one  False  Alarm  occurring  in  rsync.  Then  using  (1)  and  assuming 
that  Files  A  and  B  are  approximately  the  same  size  b  we  may  bound  p  by 


PZ 


(2) 


For  small  values  of  the  right  hand  side  of  (2)  both  p  and  FA  approximate  the 
probability  of  exactly  one  False  Alarm. 

Using  Y-Y/b=Y,  we  may  estimate  the  number  of  checkbits  CB  (in  a  checksum  with 
ideal  properties)  required  to  meet  such  P  as 

CB  =  21og2(y)  +  log2(l/  bp).  (3) 

For  example  if  p=10~6,  b=1000,  then  CB(Y)  may  be  approximated  from  (3)  as  CB(10 4  or 
10  Kbyte)=36.5,  CB(10 «  or  1  Mbyte)=49.8,  CB(10 »  or  100  Mbytes)=63.1,  CB(10™  or  10 
Gbytes)=76.4. 

Thus  the  length  of  checksums  required  to  provide  a  given  level  of  confidence  varies 
significantly  with  the  file  size.  We  therefore  suggest  that  the  checksum  sizes  be 
dynamically  chosen  for  each  rsync  transaction  based  on  equation  (3).  This  will 
significantly  reduce  the  data  transmitted  in  Step  2  of  the  rsync  protocol. 

Another  important  factor  in  the  computational  efficiency  of  rsync  is  the  use  of  non¬ 
rolling  checksums  (such  as  MD5)  as  a  backup  check  in  Step  3,  since  each  such  backup 
check  requires  MD5  to  be  calculated  from  scratch  over  the  entire  block.  This  is 
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important  for  large  files  since  the  incidence  of  32  bit  mismatches  rises  with  the  size  of 
the  file.  For  this  reason  stronger  rolling  checksums  will  also  reduce  the  computational 
effort  in  Step  3  of  rsync,  especially  for  large  files. 

7.  Conclusions 


A  methodology  is  given  with  which  the  checksums  provided  with  the  rsync  protocol 
have  been  tested  and  compared  to  new  checksum  functions  given  in  this  report.  The 
new  checksums  are  shown  to  be  stronger  than  those  used  in  rsync,  moreover  a 
particular  family  of  new  checksums  has  a  strength  that  is  consistently  close  to  ideal 
over  the  test  data.  These  checksums  may  be  updated  for  each  byte  offset  with  just  one 
multiplication  and  elementary  operations  (add,  subtract,  assign,  and  shift).  Based  on 
these  new  checksums  we  suggest  that  rsync  should  dynamically  determine  checksum 
sizes  depending  on  the  file  size.  This  will  significantly  reduce  the  bandwidth 
requirement  of  the  rsync  protocol  while  controlling  the  chance  that  the  protocol  will 
fail  because  of  checksum  collisions  (and  need  to  be  repeated). 
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